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ABSTRACT 

In order to screen out items vhich lay be biased 
against soie ethnic group prior to the final selection of iteis in 
test construction, a statistical technique for assessing itea bias 
vas developed. Based on a theoretical foriulation of R« B« 
Darlington, the method compares the performance of individuals vho 
belong to different ethnic groups^ but have equal scores on a subtest 
containing the itei. A chi square technique is used to evaluate 
differences in performance and is demonstrated on data collected 
during the item analysis phase of the current revision of the 
Retropolitan Readiness Tests. (Author) 
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A NEW METHOD OF ASSESSING BIAS IN TEST ITEMS 

Janice Scheuneman 
Harcourt Brace Jovanovichy Inc. 

While test bias Is defined In many ways and must ultimately be 
assessed In terms of how a test is to be v\ajd, It would seem desirable In the 
construction of new measuring Instruments to screen out items which are likely 
to be biased before assembling the final forms of the test. In the 1976 
revision of the Metropolitan Readiness Tests (MRT) a strong effort was made 
to eliminate biased Items during the Item analysis process. Items were 
reviewed for possible bias In the content by the authors , staff members and 
minority group consultants, but It was felt a more rigorous statistical 
procedure was also required. 

Where criterion measures are not available, bias Is most frequently 
defined as Item by group Interaction. That Is, groups under consideration may 
not be equivalent In the ability being measured, but in an unbiased test the 
differences between them are expected to be consistent across Items. Pro*- 
cedures have been developed to determine which Items are contributing most to 
an Interaction, If It exists, but the focus Is on the test or subtest as a 
whole rather than on the Items. (See for example » Cleary & Hilton, 1968 , 
and Angoff & Ford, 1973c) With a large Item pool grouped somewhat arbitrarily 
into experimental test forms, however, there is little Interest in the subtest 
as a whole. The question in this case is whether an item would be biased when 
placed into some set of items measuring the same skill. The ltem*-group 
interaction procedures were not designed to address this question. Therefore, 
a new technique for assessing item bias was developed. 
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In this study, an Item Is considered unbiased If, for persons vlth 
the same ability In the area being measured, the probability of a correct 
response on the item is the same regardless of the population group membership 
of the individual. Assuming that the subtest score is a reasonably valid 
measure of the ability in question, this definition can be stated in more 
operational terms as follows: An item is unbiased if, for all individuals 
having the same score on a homogeneous subtest containing th'a item, the proportion 
of individuals getting the item correct is the same for each population group 
being considered. Using this definition and standard statistical techniques, 
it is possible to determine the probability that any item in the pool is unbiased. 
Where the probability is sufficiently low, the item is discarded. 

Method: 

Data were gathered during the fall 1973 item analysis program for the 
Metropolitan Readiness Tests , The experimental version of the MRT used for 
item analysis consisted of 14 forms, seven at each of two levels, Kindergarten 
and beginning Grade 1, with four to six subtests on each form. The Level I 
tests consisted of 28 subtests in nine areas while Level II included 37 subtests 
in twelve areas for a total of 65 subtests. The subtests were designed to measure 
various aspects of visual dlscrlmlnf.tlon, auditory discrimination, or language 
proficiency. The sample was not intended to be nationally representative, but 
was carefully selected to cover a wide range of community sizes, socio-economic 
indices and geographic regions. A total of approximately 10,500 children, or 
about 750 per form, were involved in the program. 

The MRT was administered by classroom teachers with each class 
receiving only one form of the test. The teachers were instructed to work with 
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small groupp children so that they could become aware of any difficulties 
with marking, handling the test booklet, or aimilar problems. The items were 
read aloud; in cases where no reading was required (e.g. Visual Matching) the 
items were paced to make sure each child had a chance to respond to each item. 

The booklets were machine scored and the data analyzed by computer. 
The cros8*tabulations for each item (number of correct responses by subtest 
score and population group) and other data necessary for the bias study were 
provided as part of the output. Each item was then tested for bias using a 
2 X r chi square technique, where there were two population groups and r score 
groups. It would be possible, though perhaps not desirable^ to do the analysis 
with more than two population groups at a time, but only two groups in this 
study, Blacks and Whites, had d sufficient number of children on any one form 
to be analysed. Population group membership was identified by each child's 
teacher. The value of r varied from item to Item, because it was necessary to 
combine adjacent score groups, particularly at the extr^^mes of the distribution, 
in order to get large enough expected frequencies In all cells. The number of 
score groups required varied with the difficulty of the item and the length of 
the subtest. When the probability of the obtained chi square value for an item 
fell below .30, it was recommended that the item be dropped from the pool. 

Results 

Using data from other parts of the item analysis program, seven 
subtest areas at each of the two levels were selected for inclusion in the final 
forms of the MRT. Therefore, only items in the 44 subtests in these areas were 
screened for bias. Excluding a number of items within these subtests which were 
dropped for other reasons, the chi square tests were performed on a final pool of 
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579 items. Of these, eighty, or about 14 percent, were termed biased by 
this procedure* 

Table I summariEes the results by ability area. It can be seen that 
the majority of biased items fell into the Language area. Of these, 23 items, 
or 29 percent of all biased iteas, were from Quantitative subtests although 
the 110 Quantitative items made up only 19 percent of the total item pool. The 
fewest number of biased items came from the Visual area. In the Visual Matching 
subUAt4^ ait Level II, only one out of every 45 items was called biased. 

The content of the biased items was then reexamined to determine 
possible reasons for bias«^ but in most cases the cause of the bias was not 
immediately obvious. In the case of items from the School Language tests at 
Level II, however, the biased items tended to fall ^nto a pattern. Of the 55 
School Language items tents, ten involved some negative structures. For exaflq>le, 
"Mark the thing that is unopened." or "Mark the picture which shows neither a 
cat no^ a dog." Of the seven itm^d found to be biased in these subtests, six 
involved the negative forms, while a seventh item involving negatives was only 
slightly above the cut-off point. This pattern seemed too strong to ignore even 
though the apparent direction of the bias was not consistent. For this reason, 
all ten items involving negatives were dropped from the item pool. 

The chi square test treats a deviation from equality in either direction 
as equivalent, so in order to infer the direction of the bias, it was necessary 
to look at the performance of children at each score level in more detail. In 
doing so, four distinct patterns emerged. 

1. The item was apparently biased against Blacks (Group A). 

2. The item was apparently biased against Whites (Group B). 
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3. The item was probably not blaserj at all. If adjacent 
score categories were combined into still larger categories, 
the difference between groups tended to disappear. 

4. For high scoring children the difference was in one 
direction and for low scoring children it was in the 
other. 

Further examination of items displaying the fourth pattern, which will be 
called the "differential validity pattern," revealed that the point-biserial 
correlations between the item and the subtest total score for the two population 
groups were quite different. With these items a large proportion of high 
scoring children and a small proportion of low scoring children from one group 
got the item correct as would be expected. In the other group, however, the 
change in proportion from high to low scorers was relatively slight. 

In order to illustrate the types of patterns which emerged, items 
were selected from one subtest, the Quantitative subtest from Booklet 8 
Level II. The performance of the two population groups on this subtest is 
summarized on Table II, with results from the sample items shown in Table III. 
For each item. Table III provides the difficulty value (proportion of correct 
responses); the point-biserial correlation with the subtest total; the percent 

correct responses in each score group, separately for the two population 
groups; and the chi square statistics. Item 3 shows a typical differential 
validity pattern. The distribution of correct responses across score groups 
is relatively flat for Group A, so that comparatively Group B scores much better 
than expected at the highest score level and less well at the lowest score level. 
Item 8 is an unbiased item presented for contrast with the others. Item 9 shows 
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that Group B does consistently less well than Group A except for a small 
difference in the opposite direction for the lowest scoring group. In item 10, 
Group A does consistently less well than Group B. 

Discussion 

Although the results of the bias study reported here probably do not 
warrant any general izable conclusions, there are some observations which can 
be made. First, if a distribution were made of the differences between the 
item difficulty values of the two population groups for any one subtest, the 
items which were found to be clearly biased against one of these groups would 
tend to lie at the extremes of such distributions. The magnitude of the difference 
required for an item to appear at such extremes varies from subtest to subtest, 
but it is likely that the same items would have appeared using other techniques. 
It should be noted, however, that several subtests had no biased items and many 
had biased items at only one extreme. 

In the case of items displaying the differential validity pattern, 
however, this was not so. Many of these items showed differences in difficulty 
values which were only moderate for their subtest. According to the definition 
of bias presented here, these items are biased if not clearly in favor of one 
group or another. Yet a method which uses difficulty value* even in some 
transformation or in relation to other items, as the sole variable reflecting 
bias is apt to miss these items. 

It is possible that a large difference in point-biserial correlation 
or other indicator of internal consistency could be used as a separate screening 
device, but the question then arises of haw large a difference would justify 
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dropping an item. In this study, all six items where the difference between 
the point -biserial r's exceeded .30 were found to be biased. However, for a 
number of items displaying the differential validity pattern, the differences 
were between .20 and .30, a range which also contained a number of items which 
were evidently not biased. The chi square test would appear to provide a 
meaningful criterion for determining when the departure between groups becomes 
significant. 

Another observation com ms the number of score groups formed in 
doing the chi square tests. In this study, the number of groups was generally 
kept as high as the different distributions permitted. However, this procedure 
does not appear to be optimal. The results which seemed to be spurious 
occurred most often if the number of score groups was either very low or very 
high* On the basis of experience, four to six groups would seem to be most 
satisfactory if the data allow that many to be formed. 

In conclusion, the chi square procedure described in this report 
appears to be a satisfactory technique for screening out items vhich are 
likely to be biased before the final construction of an ability or achievement 
measure. 
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TABLE I 

Number of Biased IteruS by Area 



Biased Items Total Items Tested 



Language 


N 


49 


287 




% 


61 


50 


Visual 


N 


15 


154 




X 


19 


27 


Auditory 


N 


16 


138 




% 


20 


24 


Total 


N 


80 


579 




X 


100 


100 
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TABLE II 

Quantttative Subtaat 
Level II Bccklet 8 
23 Itema 



Population Crniip 



A B 

N » 103 815 

X 9.12 13.01 

•d 3.00 3.61 

Q3 11.18 15.49 

Med. 8.94 13. il 

Ql 6.88 10.47 

Range 3t18 3-22 
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TABLE III 



Exanpleg of Itcmi 
Quantitative Subtot 



7. of Correct ReaiK>n«a«* 
Population Dlffl- Pt. Med, Mad. 

Group culty Blaerlal HI JU IjO 



df Prob** 



A 
B 



.29 
.56 



.09 
.48 



36 29 25 27 4.66 
70 30 32 17 



<20 



A 
B 



.50 
.78 



.43 
.48 



92 
93 



68 
72 



37 
50 



27 
25 



0.70 



7*. 80 



A 
B 



.50 
.53 



.34 
.33 



75 65 
65 43 



54 
25 



21 
28 



6.34 



<.10 



10 



A 
B 



.40 
.78 



.42 
.42 



83 
94 



50 
68 



33 
61 



18 
32 



6.00 



•*=-.20 



HI - scores of 12*22 for item 3 

13-22 for items S-10 

Med -HI - scores of 10-11 for item 3 

10-12 for items 8-10 

M#d-Lo - scores of 8-9 

I<o scores of 3-7 

The probability that the item is not biaaed. 
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